LitLin 19_4 453-475 fqh034 FIN
نویسنده
چکیده
Delta, a simple measure of the difference between two texts, has been proposed by John F. Burrows as a tool in authorship attribution problems, particularly in large ‘open’ problems in which conventional methods of attribution are not able to limit the claimants effectively. This paper tests Delta’s effectiveness and accuracy, and shows that it works nearly as well on prose as it does on poetry. It also shows that much larger numbers of frequent words are even more accurate than the 150 that Burrows tested. Automated methods that allow for tests on large numbers of differently selected words show that removing personal pronouns and words for which a single text supplies most of the occurrences greatly increases the accuracy of Delta tests. Further tests suggest that large changes in Delta and Delta z-scores from the likeliest to the second likeliest author typically characterize correct attributions, that differences in point of view among the texts are more significant than differences in nationality, and that combining several texts for each author in the primary set reduces the effect of intra-author variability. Although Delta occasionally produces errors in attribution with characteristics that would normally lead to a great deal of confidence, the results presented here confirm its usefulness in the preliminary stages of authorship attribution problems. LitLin 19_4 453-475 fqh034 FIN 20/10/04 9:02 am Page 453
منابع مشابه
LitLin 18_4 423-447 fqh009 FIN
Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the au...
متن کاملLitLin 18_4 361-378 fqh002 FIN
This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. We first discuss the major decisions we took when building the corpus. These relate to sampling, text collection, mark-up, and annotation. Following from this we use the corpus to study aspect marking in Chinese and British/American ...
متن کاملTable DP-1. Profile of General Demographic Characteristics: 2000 Geographic area: Houston County, Tennessee
Under 5 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 6.7 5 to 9 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 6.8 10 to 14 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 7.0 15 to 19 years . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 5.9 20 to 24 years . . . . . . . . . . . . . . . . . . . . ....
متن کاملHealth of Calcutta during the First Quarter of 1878
possible improvement in the accuracy and completeness of registration, but the difference is too material and marked to be explained by such a hypothesis. The increase is apparent under all the diseases specified in the Health Officer's tables save one. Under fevers the numbers are 1,272 against 929; chclera 475 against 644 ; bowel complaints, (dysentery and diarrhoea) 570 against 453 ; small-p...
متن کامل